Similar n-gram language model
نویسندگان
چکیده
This paper describes an extension of the n-gram language model: the similar n-gram language model. The estimation of the probability P (s) of a string s by the classical model of order n is computed using statistics of occurrences of the last nwords of the string in the corpus, whereas the proposed model further uses all the strings s′ for which the Levenshtein distance to s is smaller than a given threshold. The similarity between s and each string s′ is estimated using co-occurrence statistics. The new P (s) is approximated by smoothing all the similar n-gram probabilities with a regression technique. A slight but statistically significant decrease in the word error rate is obtained on a state-of-the-art automatic speech recognition system when the similar n-gram language model is interpolated linearly with the n-gram model.
منابع مشابه
Class-based language model adaptation using mixtures of word-class weights
This paper describes the use of a weighted mixture of classbased n-gram language models to perform topic adaptation. By using a fixed class n-gram history and variable word-given-class probabilities we obtain large improvements in the performance of the class-based language model, giving it similar accuracy to a word n-gram model, and an associated small but statistically significant improvemen...
متن کاملA State-space Method for Language Modeling
In this paper, a new state-space method for language modeling is presented. The complexity of the model is controlled by choosing the dimension of the state instead of the smoothing and back-off methods common in n-gram modeling. The model complexity also controls the generalization ability of the model, allowing the model to handle similar words in similar manner. We compare the state-space mo...
متن کاملGrowing an n-gram language model
Traditionally, when building an n-gram model, we decide the span of the model history, collect the relevant statistics and estimate the model. The model can be pruned down to a smaller size by manipulating the statistics or the estimated model. This paper shows how an n-gram model can be built by adding suitable sets of n-grams to a unigram model until desired complexity is reached. Very high o...
متن کاملSegmenting DNA sequence into 'words' based on statistical language model
[Abstract] This paper presents a novel method to segment/decode DNA sequences based on n-gram statistical language model. Firstly, we find the length of most DNA “words” is 12 to 15 bps by analyzing the genomes of 12 model species. The bound of language entropy of DNA sequence is about 1.5674 bits. After building an n-gram biology languages model, we design an unsupervised ‘probability approach...
متن کاملMulti Class-based n-gram Language Model for New Words Using Web Data
Out-of-vocabulary (OOV) words cause a serious problem for automatic speech recognition (ASR) system. Not only it will be miss-recognized as an in-vocabulary word with similar phonetics, but the error will also affect nearby words to make errors. Language models (LMs) for most of open vocabulary ASR systems treat OOV words as one entity, ignoring the linguistic information. In this paper we pres...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010